On a robust document classification approach using TF-IDF scheme with learned, context-sensitive semantics
نویسنده
چکیده
Document classification is a well-known task in information retrieval domain and relies upon various indexing schemes to map documents into a form that can be consumed by a classification system. Term Frequency-Inverse Document Frequency (TF-IDF) is one such class of term-weighing functions used extensively for document representation. One of the major drawbacks of this scheme is that it ignores key semantic links between words and/or word meanings and compares documents based solely on the word frequencies. Majority of the current approaches that try to address this issue either rely on alternate representation schemes, or are based upon probabilistic models. We utilize a non-probabilistic approach to build a robust document classification system, which essentially relies upon enriching the classical TF-IDF scheme with contextsensitive semantics using a neural-net based learning component.
منابع مشابه
Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification
We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on multiple benchmarks. The method is robust and works well on both snippets and longer documents.
متن کاملOne-Class SVMs for Document Classification
We implemented versions of the SVM appropriate for one-class classification in the context of information retrieval. The experiments were conducted on the standard Reuters data set. For the SVM implementation we used both a version of Schölkopf et al. and a somewhat different version of one-class SVM based on identifying “outlier” data as representative of the second-class. We report on experim...
متن کاملUtilizing corpus statistics for hindi word sense disambiguation
Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context. This paper compares three WSD algorithms for Hindi WSD based on corpus statistics. The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs). These weights are used in the disambig...
متن کاملNews Recommendation Using Semantics with the Bing-SF-IDF Approach
Traditionally, content-based news recommendation is performed by means of the cosine similarity and the TF-IDF weighting scheme for terms occurring in news messages and user profiles. Semanticsdriven variants like SF-IDF additionally take into account term meaning by exploiting synsets from semantic lexicons. However, semantics-based weighting techniques are not able to handle – often crucial –...
متن کاملCharacter-Based Text Classification using Top Down Semantic Model for Sentence Representation
Despite the success of deep learning on many fronts especially image and speech, its application in text classification often is still not as good as a simple linear SVM on n-gram TF-IDF representation especially for smaller datasets. Deep learning tends to emphasize on sentence level semantics when learning a representation with models like recurrent neural network or recursive neural network,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009